AITopics

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(15 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.46)

Neural Information Processing SystemsFeb-10-2026, 17:30:08 GMT

3d0758f0b95e19abc68c1c8070d36510-Paper-Datasets_and_Benchmarks.pdf

agent, instruction, platform, (15 more...)

Country:

North America > United States > California (0.04)
North America > United States > Michigan (0.04)
Europe > Sweden > Skåne County > Malmö (0.04)

Industry: Leisure & Entertainment > Games > Computer Games (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
(8 more...)

Arana, Lukas, Etxaniz, Julen, Salaberria, Ander, Azkune, Gorka

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

arXiv.org Artificial IntelligenceNov-13-2025

Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.

benchmark, large language model, machine learning, (21 more...)

2511.09396

Country:

North America > United States (0.46)
Europe (0.46)
Asia (0.46)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-7-2025

Expert Evaluation of LLM World Models: A High-$T_c$ Superconductivity Case Study

Guo, Haoyu, Tikhanovskaya, Maria, Raccuglia, Paul, Vlaskin, Alexey, Co, Chris, Liebling, Daniel J., Ellsworth, Scott, Abraham, Matthew, Dorfman, Elizabeth, Armitage, N. P., Feng, Chunhan, Georges, Antoine, Gingras, Olivier, Kiese, Dominik, Kivelson, Steven A., Oganesyan, Vadim, Ramshaw, B. J., Sachdev, Subir, Senthil, T., Tranquada, J. M., Brenner, Michael P., Venugopalan, Subhashini, Kim, Eun-Ah

Large Language Models (LLMs) show great promise as a powerful tool for scientific literature exploration. However, their effectiveness in providing scientifically accurate and comprehensive answers to complex questions within specialized domains remains an active area of research. Using the field of high-temperature cuprates as an exemplar, we evaluate the ability of LLM systems to understand the literature at the level of an expert. We construct an expert-curated database of 1,726 scientific papers that covers the history of the field, and a set of 67 expert-formulated questions that probe deep understanding of the literature. We then evaluate six different LLM-based systems for answering these questions, including both commercially available closed models and a custom retrieval-augmented generation (RAG) system capable of retrieving images alongside text. Experts then evaluate the answers of these systems against a rubric that assesses balanced perspectives, factual comprehensiveness, succinctness, and evidentiary support. Among the six systems two using RAG on curated literature outperformed existing closed models across key metrics, particularly in providing comprehensive and well-supported answers. We discuss promising aspects of LLM performances as well as critical short-comings of all the models. The set of expert-formulated questions and the rubric will be valuable for assessing expert level performance of LLM based reasoning systems.

large language model, machine learning, natural language, (20 more...)

2511.03782

Country: North America > United States (0.95)

Genre: Research Report (1.00)

Industry:

Energy (0.68)
Materials (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Karbasi, Kia, Hong, Kevin, Samadi, Mohammad Amin, Pottie, Gregory

Multi-Agent Collaborative Framework For Math Problem Generation

arXiv.org Artificial IntelligenceNov-7-2025

Automatic question generation (AQG) for mathematics education remains an elusive goal for Intelligent Tutoring Systems and educators. While pre-trained transformer-based language models have significantly advanced natural language generation, they often struggle to precisely control problem complexity and cognitive demands. In this paper, we introduce a collaborative multi-agent framework as a novel method of incorporating inference-time computation into AQG. This approach leverages multiple agents that it-eratively refine generated question-answer pairs to better balance complexity and cognitive demand. We evaluate the generated questions on five meta-evaluation criteria: relevance, importance, clarity, difficulty matching, answerability, to assess the system's ability to control the required complexity and quality of the questions. Preliminary evaluations show that this collaborative multi-agent framework elevates the quality of generated educational content by fostering a more nuanced balance between cognitive challenge and clarity. These promising outcomes suggest that integrating collaborative multi-agent workflows can yield more controlled, pedagogically valuable content that can help advance automated educational content generation and adaptive learning environments.

agent, artificial intelligence, natural language, (17 more...)

doi: 10.5281/zenodo.15870246

2511.03958

Country: North America > United States > California > Los Angeles County > Los Angeles (0.29)

Genre:

Research Report > New Finding (0.46)
Research Report > Promising Solution (0.34)

Industry: Education > Educational Technology > Educational Software > Computer Based Training (0.55)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.75)

arXiv.org Artificial IntelligenceOct-27-2025

A Diagnostic Benchmark for Sweden-Related Factual Knowledge

Kunz, Jenny

Many Swedish benchmarks are translated US-centric benchmarks, and therefore not suitable for testing knowledge that is particularly relevant, or even specific, to Sweden. We therefore introduce a manually written question-answering benchmark specifically targeted to Sweden-related personalities and events, many of which receive very limited coverage in international media. Our annotators drew inspiration from a popular radio program featuring public figures from culture and media, as well as major sports events in Sweden. The dataset can be used to measure factual recall across models of varying sizes and degrees of Swedish coverage, and allows to probe cross-lingual factual consistency as to contains English translations. Using the dataset, we find that smaller models with stronger Swedish coverage perform comparably to a three times larger multilingual model in recalling Sweden-related facts. We also observe that continued pre-training on Swedish generally improves factual knowledge but also leads to forgetting of a part of the previously known information. These results demonstrate the dataset's potential as a diagnostic tool for studying language adaptation and knowledge retention in multilingual models and during language adaptation.

artificial intelligence, large language model, natural language, (17 more...)

2510.2136

Country:

Europe > Sweden (1.00)
Europe > Austria > Vienna (0.14)

Genre: Research Report > New Finding (0.88)

Industry:

Media (0.93)
Leisure & Entertainment > Sports > Soccer (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.32)

Jubair, Sheikh, Omayrah, Arwa, Alshammari, Amal, Althnian, Alhanoof, Alothaimen, Abdulhamed, Alzahrani, Norah A., Alzaidi, Shahad D., Al-Twairesh, Nora, Al-Thubaity, Abdulmohsen

LC-Eval: A Bilingual Multi-Task Evaluation Benchmark for Long-Context Understanding

arXiv.org Artificial IntelligenceOct-21-2025

Recent advancements in Large Language Models (LLMs) have demonstrated sophisticated capabilities, including the ability to process and comprehend extended contexts. These emergent capabilities necessitate rigorous evaluation methods to effectively assess their performance in long-context understanding. In this paper, we present \textbf{LC-Eval}, a bilingual, multi-task evaluation benchmark designed to evaluate long-context understanding in English and Arabic, targeting context lengths ranging from 4k to over 128k tokens. LC-Eval introduces four novel and challenging tasks: multi-document question answering, bilingual question answering, claim verification within a paragraph, and multiple-choice questions based on long contexts. These tasks are designed to assess LLMs' abilities in deep reasoning, document comprehension, information tracing, and bilingual information extraction and understanding. The benchmark includes datasets in both Arabic and English for each task, allowing for a comparative analysis of their performance across different text genres. Evaluations were conducted on both open-weight and closed LLMs, with results indicating that LC-Eval presents significant challenges. Even high-performing models, such as GPT-4o, struggled with certain tasks, highlighting the complexity and rigor of the benchmark.

large language model, machine learning, natural language, (18 more...)

2510.16783

Country:

North America > United States (0.93)
Europe (0.92)

Genre:

Research Report (0.64)
Questionnaire & Opinion Survey (0.49)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsOct-10-2025, 18:04:23 GMT

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

Seeking answers to questions within long scientific research articles is a crucial area of study that aids readers in quickly addressing their inquiries.

dataset, figure and table, l3score, (16 more...)

Country:

Asia > Singapore (0.04)
Asia > Indonesia > Bali (0.04)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.96)
(2 more...)

Neural Information Processing SystemsOct-10-2025, 06:49:07 GMT

cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers

Besides textual paragraphs, researchers rely on various modalities to describe research methods. Figures convey information about concepts developed throughout the paper.

computational linguistic, dataset, equation, (15 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(15 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.46)